Skip to content

feat: decoupled replay -- flow recording and independent NS3 replay#291

Open
yanzhenghao wants to merge 32 commits into
aliyun:masterfrom
yanzhenghao:feat/decoupled-replay-phase1
Open

feat: decoupled replay -- flow recording and independent NS3 replay#291
yanzhenghao wants to merge 32 commits into
aliyun:masterfrom
yanzhenghao:feat/decoupled-replay-phase1

Conversation

@yanzhenghao

Copy link
Copy Markdown

Summary

Implements decoupled replay: SimAI captures flow metadata and timing during coupled simulation, then an independent NS3 binary replays from the flow file without linking any SimAI code.

SimAI side

  • MockNcclGroup: flow buffer accumulation, send/completion-time recording, completion-based relative_delay_ns. Per-rank sorted map resolves prev[] rank IDs to predecessor flow IDs via lower_bound.
  • AstraSimNetwork.cc: recordFlowSendTime() in sim_send(), explicit finalize between Run and Destroy
  • Sys.cc: finalizeFlowFile() in destructor (analytical safety net)
  • entry.h: recordFlowCompletionTime() in qp_finish()
  • common.h (mirror 3 copies synced): uint64_t relative_delay_ns field
  • check-common-h-consistency.sh: CI diff-check script

Independent binary

8 files under ns-3-alibabacloud/simulation/scratch/decoupled_replay/. SetConfig() is called explicitly after ReadConf() because the independent binary has no SimAI framework to apply NS3 defaults (QCN, PFC thresholds, CC mode).

Scheduling: layer constraint (hard gate) + relative_delay_ns (soft gate). No flow-level dependency graph. Causality fully encoded in completion-based timing from Phase 1.

Co-Authored-By: Claude noreply@anthropic.com

Anthony and others added 30 commits April 17, 2026 17:54
- Fix curl global init thread safety: use singleton CurlGlobalManager
- Fix cross-rack detection: use global_rank_rack_map_ instead of gpus_per_server_
- Initialize WorkloadConfig members with default values
- Optimize dependency tracking from O(n²) to O(n) using map lookup
- Add error return values to OxcFlowOutput functions
- Rename static debug counters for clarity
- Add DP workload test file
- Update design document with Mermaid diagrams

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove git submodules (SimCCL/aicb/ns-3-alibabacloud) — all code in one repo
- Add ranks field to LogItem CSV output with participating GPU rank IDs
- Add sidecar _rank_mapping.csv with full rank group decomposition
- Add rank_mapper.py: CommGroup-to-RankGenerator token bridge (7 group types)
- Add _fill_ranks() in WorkloadGenerator for automatic rank population
- Add Domain Flow Graph + Domain MsgSize Bar visualization charts
- Add per-rank CSV generation script (generate_per_rank_csv.py)
- Add 15 unit tests (rank_mapper + LogItem serialization)
- Fix LRA gate to support multiple concurrent in_progress features
- Deep Interview spec + Ralplan consensus plan artifacts

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v3 field mapping: node_ip→node_id, a_node_ip→a_node_id,
port_infos→port_id_list, port_name→port_id, server_type→chassis_topo

merger.py: N:N topology (OXC→spine→leaf fan-out, server→N leaves)
ns3_emitter.py: bandwidth from port_id (800GE→800Gbps), NPU from chassis_topo
edg_client.py: spine-aware mock crosses and smart adjustment
HomePage.tsx: frontend v3 auto-detect (server IP, bandwidth, NPU type)
lld_to_topology.py: v3 visualization with IP-based cell IDs
SimAI.conf: /etc→/tmp paths, +800Gbps rate map
SimAI_training_workload_generator.py: fix get_model_details() model→self.model

Tests: 99/99 pass, NS3 verified 8/16/32 GPU ALLREDUCE with AIOB workload

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace spine_to_leaves[spine_ip] with spine_port_to_leaf[(spine_ip, spine_port)]
for exact per-port edge resolution from LLDP data. No hardcoded formulas.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lient

Apply re.sub(r"\(\d+\)$", "", node_id) in _build_edge_maps, resolve_paths,
_mock_baseline_crosses, and _smart_adjustment so OXC node_id "IP(0)"
matches edge a_node_id "IP(0)" regardless of format.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Migrate lld.json/init_crosses.json from per-session workspace to global
EDG_DATA_ROOT/{topology_dir}/ storage. wizard-store add zustand/persist
for EDG graph data survival across page refreshes.

- server/config.py: add EDG_DATA_ROOT config
- server/edg/routes.py: _edg_global_dir() + _edg_load() with global-first,
  workspace-fallback strategy. init writes to both stores.
- edg-api.ts + EdgPage.tsx: pass topologyDir to baseline-graph/register-task
- wizard-store.ts: persist EDG graph data to localStorage
- feature_list.json: F091 added, F090 marked done

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v3 lld server node_id is a name (superpod#0_server#0), not an IP.
Rename across 10 files: frontend types/stores/api/pages + backend
routes/merger/tests. npu_match server_ip field preserved (external
EDG protocol contract).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- network-store setActiveNetwork now resets wizard store and clears
  ocs-sim-wizard localStorage to prevent stale graph data leakage
- lld_to_topology.py now detects group_id from lld.json and generates
  per-group pod XMLs instead of 1 pod per input file (8 pods for 8 groups)
- Fix pre-existing ntype→node_type variable reference bug in
  generate_pod_xml / generate_pod_xml_with_crosses

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Major changes:
- lld.json: rewrite to 8NPU×8port per server topology (64 leaf, 512 edges)
- ns3_emitter.py: each NPU→all leaves (not round-robin 1:1)
- merger.py: fix IP string sort→numeric sort for leaf ordering
- F091: EDG init global persistence (EDG_DATA_ROOT)
- F092: serverIps→serverIds rename (v3 lld uses names not IPs)
- network-store.ts: localStorage migration + network switch reset
- routes.py: global EDG store + server_ids params
- Various OXC/NS3 C++ fixes from previous sessions

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ns3_emitter.py: add unfolded mode (explicit spine/OXC switch nodes)
  via optional lld param. Leaf→Spine + Spine→OXC replaces Leaf↔Leaf.
  Backward compatible: no lld = folded mode.
- merger.py: fix IP string sort→numeric sort for leaf ordering
- lld.json: rewrite to 8NPU×8port×8leaf per server topology
- F091: EDG init global persistence (EDG_DATA_ROOT)
- F092: serverIps→serverIds rename + localStorage migration

Unfolded topology: 35 nodes (16NPU+16Leaf+2Spine+1OXC), 202 links.
Cross-server path: NPU→Leaf→Spine→OXC→Spine→Leaf→NPU (7 hops).
Single OXC avoids NS3 multi-path routing loops.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ment + WorkloadPage stale EDG cleanup

- ProcessList: add View cmd / View error expandable buttons with
  return_code + error_message display, color-coded status badges
- ns3_emitter: raise RuntimeError when lld has spine/OXC but unfolding
  produces zero links — no more silent fallback to folded mode
- WorkloadPage: clear edgTopologyPath/BaselineGraph/AdjustedGraph/Diff
  on new workload generation to prevent stale EDG topology leak

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… inference, CP, Chakra, tensor graph

Complete AICB workload generator extensibility implementation (F093-F099):

F093: Model registry -- MODEL_REGISTRY dict + _bootstrap.py registration
      replaces hardcoded if/elif dispatch in generate_megatron_workload.py
F094: LLaMA MockedModel -- MockedLlama.py (539 lines): GQA + SwiGLU + RMSNorm
      pre-norm, reuses MegatronColumn/RowLinear for TP. Supports LLaMA
      2/3/4 configs (7B through 70B, dense and MoE).
F095: Parameterized MoE routing -- --n_shared_expert moved from DeepSeek-only
      to get_moe_params (all MoE models). MOEMLP shared expert computation.
F096: Qwen3 inference -- MockedQwen3Moe.py (344 lines, 8 classes) +
      MockedQwen3Next.py (287 lines, 4 classes + GatedDeltaNet).
F097: Context Parallelism -- CommGroup.cp_group + ContextParallelRing (110
      lines) for ring-attention isend/irecv between CP neighbors.
F098: Chakra output format -- ChakraWriter (178 lines) converts AICB LogItem
      to MLCommons Chakra JSON schema (COMP_ONLY/COMM_COLL/COMM_SEND/COMM_RECV).
F099: Declarative tensor graph -- tensor_graph package (345 lines): TensorGraph
      CSV load/dump, ReplicateGraph layer stacking, ConnectGraph port wiring.
      SwiGLU FFN 8-line CSV template as proof-of-concept.

Also: registry.py (119 lines) + _bootstrap.py (103 lines) infrastructure,
      --num_kv_heads CLI arg for GQA architectures,
      test_registry.py and test_mocked_llama.py test files.

Research: research_aicb_extensibility.md (491 lines) -- STAGE paper analysis,
          PARAM/Chakra comparison, 2025-2026 model parallel strategy survey.

21 files, +3662/-130

Co-Authored-By: Claude <noreply@anthropic.com>
…g, Qwen3 inference, CP, Chakra, tensor graph"

This reverts commit 8604a3c.
- Fetch simulation progress via fetchProgress API for running processes
- Show progress bar with percentage, layer count, and ETA
- Extract and display workload filename from command line (-w argument)
- Restructure layout into two-row format (status+PID+buttons / progress bar)
- Add formatETA and extractWorkloadName helper functions

Co-Authored-By: Claude <noreply@anthropic.com>
All three use LLaMA-compatible architecture (RMSNorm + SwiGLU + GQA + RoPE),
reuse existing MegatronModel workload generator.
Verified parameters from deep-research HF config.json analysis.

Co-Authored-By: Claude <noreply@anthropic.com>
…letion-based timing

Adds flow recording instrumentation for NS3 decoupled replay:
- MockNcclGroup: flow buffer accumulation, send-time & completion-time recording,
  deferred finalizeFlowFile() with completion-based relative_delay_ns
- AstraSimNetwork.cc: recordFlowSendTime in sim_send, explicit finalize in main
- Sys.cc: finalizeFlowFile call in destructor (analytical mode safety net)
- common.h: relative_delay_ns field in FlowRecord
- scripts/check-common-h-consistency.sh: CI diff-check for 3 common.h copies

relative_delay_ns = send_time - max(prev completion times), clamped to 0.
No flow-level dependency graph -- causality fully encoded in timing.

Co-Authored-By: Claude <noreply@anthropic.com>
Independent binary (scratch/decoupled_replay/): 8 files, 2,596 lines
- Zero SimAI linkage (nm check target)
- DepScheduler with prev[] dependency graph + layer_num constraint
- flow_reader.h parses complete 21-field format
- Whitelisted in scratch/.gitignore
CI: scripts/check-common-h-consistency.sh

Co-Authored-By: Claude <noreply@anthropic.com>
Plan and all documentation reference scratch_decoupled_replay. CMakeLists.txt
had scratch_decoupled_replay_main which would break nm verification commands.

Co-Authored-By: Claude <noreply@anthropic.com>
prev[] contains rank IDs (0,1,2...) but _flow_completion_times is keyed
by flow ID (10423,10424...). The naive _flow_completion_times.count(pid)
always failed for ring allreduce flows, causing every flow to fall back
to relative_delay_ns = absolute send_time.

Build a per-rank sorted map of (flow_id, completion_time) pairs, then
resolve each prev rank ID to the most recent predecessor flow from that
rank via lower_bound. This correctly computes relative_delay_ns as
send_time - max(predecessor completion times).
Co-Authored-By: Claude <noreply@anthropic.com>
Cherry-picked reverted commits from submodule reflog:
- feat: decoupled replay Phase 2 (1598bbc)
- fix: GPUType enum, fct_writer format string (f0e19bb)
- refactor: inline SendFlow, remove _QPS_PER_CONNECTION_ (6843092)
- fix: sequential step numbering in main.cc (64a3613)
prev[] contains rank IDs (0,1,2...) but _flow_completion_times is keyed
by flow ID (10423,10424...). The naive _flow_completion_times.count(pid)
always failed for ring allreduce flows, causing every flow to fall back
to relative_delay_ns = absolute send_time.

Build a per-rank sorted map of (flow_id, completion_time) pairs, then
resolve each prev rank ID to the most recent predecessor flow from that
rank via lower_bound. This correctly computes relative_delay_ns as
send_time - max(predecessor completion times).
Co-Authored-By: Claude <noreply@anthropic.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Anthony
❌ yanzhenghao


Anthony seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

yanzhenghao and others added 2 commits June 18, 2026 13:46
Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants